Introduction to Gadfly
Gadfly is a library for plotting and visualization written in Julia. It is based largely on Hadley Wickhams's ggplot2 for R and Leland Wilkinson's book The Grammar of Graphics. Similar package in python is plotnine (https://realpython.com/ggplot-python/).
Some of the features are::
- Renders publication quality graphics to SVG, PNG, Postscript, and PDF
- Intuitive and consistent plotting interface
- Works with Jupyter notebooks via IJulia out of the box
- Tight integration with DataFrames.jl
- Interactivity like panning, zooming, toggling powered by Snap.svg
- Supports a large number of common plot types
Additional Recommended Resources:
Gadfly Documentation: http://gadflyjl.org/stable/using CSV
using DataFrames
using Statistics
using FreqTables
using StatsBase
using DataFramesMeta
using Gadfly
using NamedArrays ##For named arrays
ENV["COLUMNS"] = 1000
ENV["LINES"] = 20
This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for the "mpg" attribute. The original dataset is available in the file "auto-mpg.data-original".
"The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993)
Attribute Information:
Read the data from the url source (fixed-width formatted lines)
homedir()
pwd()
#autos_df = pd.read_csv('./data/autos_df.csv',
#index_col=['car_name'])
#autos_df.head()
autos_df = CSV.read("./data/autos_df.csv", DataFrame)
#autos_df.info()
eltypes(autos_df)
autos_df[:,:origin] = convert.(Float64, autos_df[:,:origin])
autos_df[:,:origin] = convert.(Int64, autos_df[:,:origin])
autos_df[:, :s_origin] = string.(autos_df[:,:origin])
autos_df[:, :origin] = parse.(Int64, autos_df[:,:s_origin]);
autos_df[!, :origin] = parse.(Float64, autos_df[:,:s_origin])
head(autos_df,2)
#head(autos_df,2)
Get the mean, median, mode etc
#autos_df.describe()
describe(autos_df)
horsepower seems to have missing values
#autos_df["horsepower"].isnull().values.any()
#autos_df[autos_df.horsepower.isnull()]
any(ismissing.(autos_df.horsepower))
#autos_df = autos_df.dropna()
autos_df = dropmissing(autos_df,:)
#autos_df[["mpg", "displacement","horsepower","weight","acceleration"]].describe()
describe(autos_df[:,["mpg", "displacement","horsepower","weight","acceleration"]])
What is the average miles per gallon for car with different origin
#autos_df.groupby('origin')['mpg'].mean()
combine(groupby(autos_df,:s_origin), :mpg .=> [mean])
Using Gadfly for different plots.
Histogram is a tool to visualize one dimensional data which is continous in nature. Given a collection of single random variables:
- Choose a interval (bins in which the entire dataset can be bucketed)
- Count the data points within each bin (the y axis represents the frequency count)
plot function with Geom.histogram frokm gadfly package will render a histogram.
autos_df[autos_df.s_origin .== "1",:]
#sn.distplot(autos_df[autos_df.origin == 1]['mpg'], label = 'American')
#plt.legend()
set_default_plot_size(24cm, 12cm)
origin_filter = autos_df[autos_df.s_origin .== "1",:]
hp1 = plot(origin_filter,
x="mpg",
Geom.histogram,
Guide.title("Cars of American Origin - 1"))
hp2 = plot(origin_filter,
x="mpg",
Geom.histogram(bincount=30),
Guide.title("Cars of American origin - 2"))
Plot the distribution of miles per gallon for cars from all origin
#sn.distplot(autos_df[autos_df.origin == 1]['mpg'],label = 'American', hist=True)
#sn.distplot(autos_df[autos_df.origin == 2]['mpg'], label = 'European',hist = True)
#sn.distplot(autos_df[autos_df.origin == 3]['mpg'], label = 'Japenese',hist = True)
#plt.legend()
##If plotting mpg for cars from american, european and japanese make.
hp3 = plot(autos_df,
x=:mpg,
color=:s_origin,
Geom.histogram(position=:identity, bincount=40),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(alphas=[0.5], discrete_highlight_color=identity),
Guide.title("Cars with all origins"))
hstack(hp1, hp2, hp3)
- Plot the histogram along with density
#Histogram with density plot
hp4 = plot(autos_df,
x=:mpg,
color=:s_origin,
Geom.density(),
Geom.histogram(position=:identity, density = true, bincount=40),
Guide.title("Histogram and Density Plot-1"),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(alphas=[0.5], discrete_highlight_color=identity))
#The other way to write p4
hp5 = plot(autos_df,
x=:mpg,
color=:s_origin,
layer(Geom.density(), Geom.histogram(position=:identity, density = true, bincount=40)),
Guide.title("Histogram and Density Plot-1"),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(alphas=[0.5], discrete_highlight_color=identity))
gridstack([hp4 hp5])
## Density plot using Geom.polygon
hp6 = plot(autos_df,
x=:mpg,
color=:s_origin,
Stat.density(),
Geom.polygon(fill=true, preserve_order=true),
Geom.histogram(position=:identity, density = true, bincount=40),
Guide.title("Histogram and Density Plot-1"),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(alphas=[0.5], discrete_highlight_color=identity))
##The other way to write p6
hp7 = plot(autos_df,
x=:mpg,
color=:s_origin,
layer(Geom.density(), Geom.histogram(position=:identity, density = true, bincount=40)),
##Comment the above line nd uncomment the below one to see the change.
#layer(Stat.density(), Geom.polygon(fill = true, preserve_order = true)),
#layer(Geom.histogram(position=:identity, density = true, bincount=40)),
Guide.title("Histogram and Density Plot-1"),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(alphas=[0.5], discrete_highlight_color=identity))
gridstack([hp6 hp7])
#sn.distplot(autos_df[autos_df.origin == 3]['mpg'],label = 'Japenese',hist=False)
#plt.legend()
#sn.distplot(autos_df['mpg'],hist = False)
Plot the distribution of horsepower for cars with American origin and Japanese origin
Kernel density estimation(KDE) plot — plots a smooth curve shape of the distribution. It is a nonparametric estimation of density where inferences about the population is made from the finite data sample.
Parametric Data/Test: When the data is assumed to have been drawn from a particular distribution and some parametric test can be applied to it
Non-Parametric Data/Test: When we have no knowledge about the population and the underlying distribution
What is a Kernal?
Kernal: A kernel is a special type of probability density function (PDF) with the added property that it must be even. Thus, a kernel is a function with the following properties
- non-negative
- real-valued
- even
- its definite integral over its support set must equal to 1
Some common PDFs are kernels; they include the Uniform(-1,1) and standard normal distributions.
What is Kernal density estimation?
Kernel density estimation is a non-parametric method of estimating the probability density function (PDF) of a continuous random variable. It is non-parametric because it does not assume any underlying distribution for the variable. Essentially, at every datum, a kernel function is created with the datum at its centre – this ensures that the kernel is symmetric about the datum. The PDF is then estimated by adding all of these kernel functions and dividing by the number of data to ensure that it satisfies the 2 properties of a PDF:
- Every possible value of the PDF (i.e. the function, f(x)), is non-negative.
- The definite integral of the PDF over its support set equals to 1.
Steps in estimating kernal density:
- Each observation is first replaced with a normal (Gaussian) curve centered at that value.
- These curves are summed to compute the value of the density at each point in the support grid. The resulting curve is then normalized so that the area under it is equal to 1
More about KDE at:
#sn.distplot(autos_df[autos_df.origin == 1]['mpg'],hist=False, label = 'American')
#sn.distplot(autos_df[autos_df.origin == 2]['mpg'], hist = False, label = 'European')
#sn.distplot(autos_df[autos_df.origin == 3]['mpg'], hist = False, label = 'Japenese')
# Just the density plot
dp1 = plot(autos_df,
x=:mpg,
color=:s_origin,
layer(Stat.density(), Geom.polygon(fill=true, preserve_order=true)),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(alphas=[0.5], discrete_highlight_color=identity))
#Density plot with central 90% confidence limit.
dp2 = plot(autos_df,
x=:mpg,
color=:s_origin,
#layer(Stat.density, Geom.polygon(fill=true, preserve_order=true), alpha=[0.4]),
layer(Geom.density()),
layer(Stat.quantile_bars(quantiles=[0.05, 0.95]), Geom.segment),
Guide.title("1st Density plot with 90% CI"),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(alphas=[0.2], discrete_highlight_color=identity))
##The other way
dp3 = plot(autos_df,
x=:mpg,
color=:s_origin,
layer(Stat.density, Geom.polygon(fill=true, preserve_order=true) ), #alpha=[0.4]
layer(Stat.quantile_bars(quantiles=[0.05, 0.95]), Geom.segment),
Guide.title("2nd Density plot with 90% CI"),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(alphas=[0.4], discrete_highlight_color=identity))
#hstack(dp1,dp2,dp3)
gridstack([dp1 dp2 dp3])
In Seaborm package (Python), barplot function has an estimator paramter which will anyways estimate the average value of a numeric feature for each categorical feature.
#sn.barplot(y = 's_origin',
# x = 'cylinders',
# data = autos_df,
# )
If it is just a count of values w.r.t a variable:
plot(autos_df,
x = "cylinders",
Geom.bar,
Theme(bar_spacing = 4mm)
)
Here we have to compute it separate before using plot function.
To plot average miles per gallon for different cylinder types using DataFrames and Gadfly:
- Use groupby method to group by
cylindersand calulcate mean of mpg. Name this dataframe asmpg_cylinder_df.- call plot() from gadfly with Geom.bar to plot
mpg_cylinder_df.
mpg_cylinder_df = combine(groupby(autos_df,[:cylinders]), :mpg .=> [mean])
bp1 = plot(mpg_cylinder_df,
x=:cylinders,
y=:mpg_mean,
Geom.bar(), #position=:dodge or :stack (but not needed for now.)
Guide.title("1st Bar plot"),
Theme(
bar_spacing=4mm,
key_position=:right),
Coord.cartesian(xmin=2, xmax=9))
- Use groupby method to group by
cylinders,originand calulcate mean of mpg. Name this dataframe asmpg_cylinders_origin_df.- call plot() from gadfly with Geom.bar to plot
mpg_cylinder_df.
#sn.barplot(x = 'cylinders',
# y = 'mpg',
# hue = 'origin',
# data = autos_df)
mpg_cylinders_origin_df = combine(groupby(autos_df,[:cylinders,:s_origin]), :mpg .=> [mean])
mpg_cylinders_origin_df.label=string.(round.(Int, mpg_cylinders_origin_df.mpg_mean))
mpg_cylinders_origin_df
bp2 = plot(mpg_cylinders_origin_df,
x=:cylinders,
y=:mpg_mean,
color=:s_origin,
label=:label,Geom.label(position=:centered), Stat.dodge(position=:stack),
Geom.bar(position=:stack),
Guide.title("1st Bar plot"),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(
bar_spacing=4mm,
key_position=:right),
Coord.cartesian(xmin=2, xmax=9))
Make the change in the above code to make the bar chart vertically dodged.
A horizontal dodged chart can be plotted as below:
bp3 = plot(mpg_cylinders_origin_df,
y=:cylinders,
x=:mpg_mean,
color=:s_origin,
label=:label,Geom.label(position=:right), Stat.dodge( axis=:y),
Geom.bar(position=:dodge,orientation=:horizontal),
Guide.title("2nd Bar plot"),
Guide.yticks(orientation=:vertical),
Guide.ylabel("# Cylinders"),
Guide.xlabel("Avg miles per gallon"),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(
bar_spacing=4mm,
key_position=:right),
Coord.cartesian(ymin=2, ymax=9, yflip=true))
- Use groupby method to group by
cylinders,s_originand and count the number of car. Name this dataframe asorigin_cylinders_df.- call plot() from gadfly with Geom.bar to plot
mpg_cylinder_df.
origin_cylinders_df = combine(groupby(autos_df,[:s_origin,:cylinders]), nrow .=> [:count])
bp4 = plot(origin_cylinders_df,
x=:cylinders,
y=:count,
color=:s_origin,
layer(
label=string.(origin_cylinders_df.count),
Geom.label(position=:centered),
Stat.dodge(position=:dodge)),
Geom.bar(position=:dodge),
Guide.title("Count of cars with cylinders and origin"),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(
bar_spacing=4mm,
key_position=:right),
Coord.cartesian(xmin=2, xmax=9))
Can be used to visulize a square matrix, say a correlation matrix. To visulize a rectangular data, Gadfly has Geom.rectbin:
http://gadflyjl.org/stable/gallery/geometries/#[Geom.rect](@ref),-[Geom.rectbin](@ref)
We will proceed with visualizing a correlation matrix. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed, and not necessarily zero-mean.
- Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation.
- Correlations of -1 or +1 imply an exact linear relationship.
- Positive correlations imply that as x increases, so does y.
- Negative correlations imply that as x increases, y decreases.
autos_cor_df = autos_df[:,["mpg",
"displacement",
"horsepower",
"weight",
"acceleration"]]
head(autos_cor_df)
Below code using pandas will give correlation matrix in python
#auto_cor_df.corr()
cor(autos_df[:,"mpg"],autos_df[:,"horsepower"])
cor(Matrix(autos_cor_df))
cor_matrix = NamedArray( cor( Matrix(autos_cor_df) ) )
cor_matrix = NamedArray( cor( Matrix(autos_cor_df)),
([names(autos_cor_df);],[names(autos_cor_df);]),
("Rows", "Cols"))
A developers way of doign the above. The below code block is not mine.
Source: https://discourse.julialang.org/t/first-impression-of-dataframes-jl/49753/5
struct NoPrint end; Base.show(::IO, ::NoPrint) = nothing
NamedArray([i > j ? cor(autos_cor_df[!, i], autos_cor_df[!, j]) :
NoPrint()
for i in 2:ncol(autos_cor_df),
j in 1:ncol(autos_cor_df)-1],
(names(autos_cor_df)[2:end], names(autos_cor_df)[1:end-1]))
End of developers code.
Proceeding with heatmap
Heatmap using seaborn in python.
#sn.heatmap(auto_clean_df.corr(),
# annot = True,
# cmap = sn.diverging_palette(250, 10, n = 25))
The basic heatmap using spy function from Gadfly
spy(cor_matrix)
cor_names= [names(cor_matrix,1);] ##could have written (cor_matrix,2) as well
spy(cor_matrix,
Scale.y_discrete(labels = i->cor_names[i]),
Scale.x_discrete(labels = i->cor_names[i]),
Scale.color_continuous(colormap=Scale.lab_gradient("red", "white", "green")))
Scatter plot is a cloud of points showing a joint distribution of two numerical variables where each point represents an observation from the dataset. Helps to understand the relationship between two numerical variables
#sn.jointplot(x = 'mpg',
# y = 'weight',
# hue = 's_origin',
# data = autos_df,
# kind = 'scatter'
# )
Gadfly.push_theme(:dark) #default
plot(autos_df, x="mpg", y="weight", color = "s_origin", Geom.point)
The problem with scatter plot in over plotting. When dataset is huge, dots of the scatterplot tend to overlap, and graphic becomes unreadable and meanigless.
In one dimension straight line segments are the only possible shape for bin in a histogram. However for data in two dimensions bins can be more general shape (rectangular/Hexagon):
- The obvious strategy is to choose a rectangular bin to build a histogram. Imagine the above scatter plot being filled with rectangular boxes (where the boxes represents the bins in horizontal and vertical direction). The count of values in each of the bins can be colored with gradient fill.
- The hexagon tiling uses a hexagon shape for binning. The same scatterplot chart can be filled with hexagon shapes and the count of points falling in each hexagon can be used to fill the shape
Hex plot mpg and weight of the cars
Using Geom.hexbin from Gadfly to draw the hexabin plot
This is how it is done in seaborn (Python):
#sn.jointplot(x = 'mpg',
# y = 'weight',
# data = autos_df, color = 'k',
# kind = 'hex'
# )
Box plot is a graphical representation of numerical data that can be used to understand the variability of the data and the existence of outliers.
A boxplot is a graph that gives you a good indication of how the values in the data are spread out.
To generate a box plot: Assume data as : 98, 77, 85, 88, 82, 83, 87, 67, 100, 63, 105
- Arrange data in ascending order: 63, 67, 77, 82, 83, 85, 87, 88, 98, 100, 105 Calculate the median (middle value of the data, 85). This is Q2Calculate the median of the first half of the data, 77). This is Q1. *Calculate the median of the second half of the data, 98). This is Q3.
- The box joins Q1 to Q3 (contains middle 50% of data).
- IQR = Q3 - Q1 = 11
- LIF = Q1 - 1.5*IQR = 60.5 ; UIF = Q3 + 1.5 IQR = 114.5
- The point adjancent to LIF is 67 and point adjancent to UIF is 105.
- The smallest observation greater than or equal to LIF builds lower whisker.
- The largest observation less than or equal to UIF builds upper whisker.
Point outside the fences are outliers.
Intrepret boxplot:
If wide box and long whiskers, then maybe the data doesn’t cluster. If box is small and the whiskers are short, then probably your data does indeed cluster If box is small and the whiskers are long, then maybe the data clusters, but have some “outliers”
Gadfly.push_theme(:default)
bp1 = plot(autos_df, x=:cylinders, y=:mpg, color=:s_origin,
Geom.boxplot, Theme(boxplot_spacing=0.1mm),
Guide.title("Boxplot of mpg with #cylinders"),
Scale.color_discrete_manual("skyblue","red","green")
)
What if we want to bring more than 4 dimensions of data in the same plot.
set_default_plot_size(24cm, 20cm)
subgrid1_df = combine(groupby(autos_df,[:s_origin,:cylinders]), :mpg .=> [mean])
gp1 = plot(subgrid1_df,
x=:s_origin,
y=:mpg_mean,
#xgroup=:cylinders, ##it can be any other categorical variable.
ygroup=:cylinders,
color=:s_origin, ## This can be some other categorical variable. Putting origin does not give any info.
Geom.subplot_grid(
layer(
Geom.bar(position=:dodge)
),
#To label the bars.
layer(
label=string.(round.(Int,subgrid1_df.mpg_mean)),
Geom.label(position=:above)#,
#Stat.dodge(position=:dodge)
)),
Guide.title("Avg miles per gallon with cylinders and origin"),
Scale.color_discrete_manual("skyblue","red","green"),
Theme(
bar_spacing=4mm,
key_position=:right)
)
subgrid2_df = combine(groupby(autos_df,[:model_year,:s_origin,:cylinders]), :mpg .=> [mean])
tail(subgrid2_df)
gp2 = plot(subgrid2_df,
x=:model_year,
y=:mpg_mean,
#xgroup=:model_year, ##it can be any other categorical variable.
ygroup=:cylinders,
color=:s_origin, ## This can be some other categorical variable. Putting origin does not give any info.
Geom.subplot_grid(
layer(
Geom.line()
),
layer(
label=string.(round.(Int,subgrid2_df.mpg_mean)),
Geom.label(position=:dynamic, hide_overlaps=true)
)),
Guide.title("Avg miles per gallon with cylinders and origin"),
Scale.color_discrete_manual("skyblue","red","green"),
Scale.x_continuous(;minvalue=65, maxvalue=80),
Theme(panel_stroke=colorant"black",
grid_line_width=0mm) ## we can use style() or theme to pass these parameters.
)
More on setting themes: http://gadflyjl.org/v0.7/man/themes.html#style-1
Draw a boxplot using Geom.subplot_grid to split the boxplot with ygroup = s_origin. Refer Geom.subplot_grid used to plot barplot in above section.
Gadfly provides geometricall object to get fancy plots like
and many more....